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Some classic machine translation (MT) Evaluation methods, such as the 
bilingual evaluation understudy score (BLEU), have notably underperformed 
in evaluating machine translations for morphologically rich languages like 
Arabic. However, the recent remarkable advancements in the domain of 
word vectors and sentence vectors have opened up new research avenues for 


low-resource languages. This paper proposes a novel linguistic-based 
evaluation method for English-translated sentences in Arabic. The proposed 
approach includes penalties based on length, positions, and context-based 
schemes such as part-of-speech tagging (POS) and multilingual sentence- 
BERT (SBERT) models for machine translation evaluation. The proposed 
technique is tested using pearson correlation as a performance evaluation 
parameter and compared with state-of-the-art techniques. The experimental 
results demonstrate that the proposed model evidently outperforms other MT 
evaluation methods such as BLEU. 
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1. INTRODUCTION 

In today's technology-driven landscape, a state-of-the-art approach across machine translation (MT) is 
imperative. Evaluating the quality of translated text by machine translation systems is one of the prime 
concerns. There is no denying that the assessment of machine translations by humans is a time-consuming and 
expensive practice [1]. Moreover, considering the rising trend of machine-based translation, tools to automate 
MT evaluation are of utmost significance for researchers in natural language processing (NLP) and specifically 
for MT research. Translation of text has been done traditionally by human translators, which is a very costly, 
time-consuming and biased process. MT system, however, gained more prominence with the advent of new 
NLP methods in linguistic evaluations and MT systems improvements [2]. The exertion of excessive resources 
in the form of time and money on the research, design, and implementation of an MT system necessitates the 
meticulous evaluation of the entire MT development cycle. It is essential to identify and analyze the possible 
source of errors and explore potential approaches for addressing these concerns. These approaches must be 
deployed and tested to observe whether the identified errors lessen without compromising the system 
performance. If the system performance is not affected or degraded, only then the mechanism is accepted; 
otherwise, some alternative mechanism is devised [3]. In this work, a novel system for evaluating English to 
Arabic translation has been presented. This system uses multiple measures, including part-of-speech (POS) and 
sentence-BERT (SBERT), to ensure the accuracy needed for the translation evaluation of rich and complex 
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languages like Arabic. With the help of Arabic linguistic experts, the presented method is compared with state- 
of-the-art MT evaluation systems and has proven to be more accurate. 

The rest of the paper is arranged as follows: section 2 covers the recent works on machine translation. 
Section 3 presents the proposed approach and methodology, while section 4 describes BERT-based sentence 
similarity computation. Section 5 discusses the experimental setup and evaluation results of the proposed Arabic 
machine translation system. Lastly, section 6 concludes the paper and identifies potential directions for 
extending this work in the future. 


2. RELATED WORK 

Human evaluation of machine translation systems is highly subjective, time-consuming, and cannot 
be reused [4]. Recently, automatic evaluation methods have gained substantial attention due to their 
unprecedented benefits. bilingual evaluation understudy score (BLEU) [5] is one of the most well-known and 
widely employed algorithms for evaluating the quality of a machine-translated text. It calculates n-gram 
precision and a brevity penalty between the candidate and reference translation. For example, it takes a whole 
source sentence (n-gram) and compares it with a reference sentence (n-gram) regardless of the words' 
position. Later, some alterations have been made to the bilingual evaluation understudy (BLEU) metric to 
introduce the NIST metric, which provides weights to n-gram based on their information level. Besides that, 
in 2006, Turian et al. [6] proposed general text matcher (GTM) based on accuracy measures such as 
precision, recall, and f-measure. Another package, named ROUGE, was presented back in 2003 for the 
automatic evaluation of summaries, wherein the computer-generated summary (candidate summary) is 
evaluated with the human-created summary (reference summary) [7]. For example, the paragraphs 
summarized using ROUGE are compared with the same paragraphs summarized by a human, and n-gram- 
wise scores are compared to check the performance. 

In another relevant study, different human judgments are explored with human mediated translation 
edit rate (HTER) [8]. The HTER is a semi-automatic measure where humans do not score the MT output but 
generate a new reference translation, which is closer to the MT output and try to retain the fluency and adequacy 
of the original translated reference [8]. Following that, Callison-Burch et al. [9] attempted to correlate the 
automatic evaluation metrics with human judgments. They tried to determine which automatic system produces 
the highest quality of translation from the list of nine different automatic evaluation metrics [9] and ranked them 
with the help of comprehensive human evaluation. They also mentioned the two categorical scales currently 
used by the human evaluators to represent fluency and adequacy of the MT system. 

In 2005, Banerjee and Lavie [10] introduced METEOR, an innovative automatic method for MT 
evaluation that creates a word alignment between the two sentences, i.e., candidate translation string and 
reference translation string. The alignment is done through word mapping such as i) Stem matching, ii) Exact 
matching, and iii) Synonym matching. In recent years, an extension of the METEOR translation evaluation 
metric was presented to the phrase level in METEOR NEXT metric [11]. Previously, METEOR needed human- 
based judgments in target languages. However, METEOR Universal learns function word lists and paraphrase 
tables to provide language-specific evaluations. Furthermore, METEOR Universal depicted improved performance 
for Hindi and Russian languages since it uses a universal parameter set learned through pooling [12]. 

Several attempts, with reference translations [13]-[18] and without reference translations [16]-[19], 
are made over the years to involve ML techniques in MT evaluation. In Corston et al. [16], employed 
decision trees for evaluating the well-formedness of the MT output and building classifiers that learn to 
distinguish human translations from MT. In contrast, Akiba et al. [20] approached this as a multi-class 
classification task and trained the decision-tree classifiers on multiple edit distance features that include 
lexical, morpho-syntactic, and lexical-semantic information. Moreover, Kuleza and Shieber [17] trained a 
support vector machine (SVM) to serve the purpose. 

In Quirk [19], argued that human references are not mandatory for MT evaluation. They applied 
numerous supervised algorithms, including Decision trees, SVMs, and linear regression. All these statistical 
techniques worked well, but linear regression exhibited exceptional performance. Following that, Russo- 
Lassner et al. [21] developed a linear regression model that used stemming, WordNet synonymy, verb class 
synonymy, noun phrase heads matching, and matching proper names. Some other studies [13]-[15] also 
suggested linear regression-based models for the metric combination that outperformed most of the existing 
approaches. Gamon et al. [22], introduced a new approach that involved the training of a large corpus of 
domain-specific data instead of modeling the output on a target language. They also added perplexity scores to 
improve the sentence-level language model. Ye et al. [23] and Duh [24] approached the sentence-level MT 
evaluation as a ranking problem. They used n-grams, dependency, and translation perplexity of the reference 
language model (LM) as features for ranking the SVM algorithm. Gautam and Bhattacharyya [25] proposed a 
layered approach for MT evaluation based on lexical, syntactic, and semantic layers. They used BLEU as a 
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baseline metric for the lexical layer while Hamming score, Kendall Tau distance score, and the spearman rank 
score were considered for the syntactic layer. 

In [26], the authors explored language divergences and ambiguities in English to Arabic machine 
translation. Keeping different features of Arabic in mind, Abu-Ayyash [27] worked on errors and non-errors 
made by MT systems. He also investigated the extent to which MT systems can deliver when the language has 
different rules and grammatical representations. A comparative study between various MT systems using BLEU 
and METEOR was done in [28]. It aimed at identifying the most suitable metric for the Arabic to English 
translation system, which helps the developers enhance the effectiveness of these systems. The authors examined 
the translation accuracy of two known machine translation programs, Google translator and Babylon, translating 
the exact Arabic text into English. These methods were used to measure the quality of MT and determine the 
scheme closest to human ratings. They declared that BLEU is the best method for human rating judgments. 

Moreover, different approaches for Arabic to English translations are reviewed in [29], which depicted 
that neural machine translation approaches demonstrated greater accuracy than the other alternatives. 
Furthermore, the emerging attention-based approach is found to be remarkably effective at improving NMT's 
performance for all languages. Guzman et al. [30] presented Kendall's t scores obtained from five n-gram-based 
metrics and observed findings while training neural networks with embeddings of different representations as 
input. They added different lexical and morpho-syntactic features of languages and compared the performance 
of BLEU, NIST, METEOR, and 1-TER, along with AL-BLEU. They observed that both NIST and METEOR 
obtained approximately the same performance for this task. In another study, Shimanaka et al. [31] explored the 
utility of universal sentence representations to measure machine translation quality, while training sentence 
representations using a small translation dataset is a challenging task. They also introduced the regressor using 
sentence embeddings (RUSE) metric during a workshop on machine translation'18 (WMT18) session. RUSE 
uses sentence embeddings and can capture global information that n-gram based models fail to capture. It has 
also been concluded that universal sentence embeddings trained on a limited or small in-domain dataset are less 
effective compared to the ones trained on a large-scale dataset. 

During the survey, several other evaluation metrics have been investigated, and it was found that 
many of these metrics have released their upgraded versions. Moreover, it was observed that some of the 
existing metrics could finely correlate with manual evaluations; however, not all of these have this capability. 
They are incapable of performing well with all the languages, particularly the morphologically rich ones. The 
summary of the literature review is given in Table 1. 


Table 1. Summary of literature review 


Study Technique Limitations Languages 
It does not handle morphologically rich languages well and Language 
[5] BLEU ; i : 
does not consider sentence structure directly. independent 
[6] GTM The target language is overlooked since the significant focus Chinese to English 
stays on the mother tongue. Arabic to English 
7 ROUGE It does not provide a conclusive understanding of how well Arabic to English 
summaries perform in comparison with human summaries. 
[8] HTER It is a strictly quantitative metric and weights all errors equally Arabic to English 
Correlation between automated i bo , Han ganan to English 
: A few investigations have shown a poor correlation between German to English 
[32 evaluation and human Ses cee ree ee 
judgement annotators utilizing this strategy Spanish to English 
J Czech to English 
Bat San PE Arabic to English 
[11 METEOR Requires lots of human effort for translation Chinese to English 
Evaluation metrics were computationally expensive and could 
[19 Linguistic model only be used if only a few different hypotheses needed to be Spanish to English 
tested 
[21 Paraphrase based Linear Inability to understand local slang Arabic to English 
regression model 
[22 SVM based Linguistic model It limits the amount of data that can be processed. French to English 
Regression-based learning 7 i Jea ai er Chinese to English 
[13]-[15] model Can lead to erroneous and misleading results Arabic to English 
[23 SVM learning algorithm Limitation on the quantity of data Chinese to English 
German to English 
[25 NLP layers based High computational cost Spanish to English 
French to English 
[27 PNMT based It has grammatical mistakes and no quality control Arabic to English 
[26 ANN and rule-based MT It demands a tremendous amount of time and linguistic Arabic to English 
system resources 
Google translation based MT i 7 ose, Redo ; Arabic to English 
[29 system Involves the use of creative linguistic tools English to Arabic 
[30 Neüral network based MT The source-text sentences must be evident and cohesive. Arabic to English 


evaluation model 
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3. THE PROPOSED METHOD 

The main drawbacks of performing manual evaluation include resource consumption, time 
utilization and task recurrence [33]. On the other hand, automatic evaluation has many advantages, such as 
good performance in some languages compared to others, primarily when English is used as a target 
language [5]. It can be attributed to the rich data resources available for English and the rich resources users 
can use for aiding evaluation, such as dictionaries, and thesauruses. However, it exhibits poor performance 
when English is used as a source language against another low-resource language, such as Arabic or Urdu. 
One of the main reasons is the lack of appropriate data for evaluation. A few metrics use several linguistic 
features that make it harder to generalize them for other languages. Moreover, other metrics, for instance 
BLEU [5], utilizes context-independent features using an n-gram precision score. Some researchers believe 
that a high BLEU score does not necessarily indicate better translation [5]. In this study, a novel MT 
evaluation method for the Arabic language is proposed inspired by [1]. Arabic is a morphologically dynamic 
language because of its word-to-word syntactical independence [34]. Even though Arabic is a widely spoken 
language; however, it is still considered a low resource language due to the unavailability of adequate Arabic 
data resources. Metrics like BLEU cannot be used directly for languages like Arabic because the structure of 
Arabic scripts is different from English and European languages. The widely used BLEU metric implements 
a brevity penalty [35] for short sentences; however, longer sentences are improperly penalized. To overcome 
this issue, a sentence length penalty factor is introduced to penalize the shorter and longer sentences in 
comparison to the reference translations. There are three types of length penalties: 
a) When the candidate sentence (translation) length is the same as the reference sentence, then there is no 

penalty, and LP is one. 

b) When the candidate sentence length is less than the reference sentence, the penalty is computed using (1): 


LP = exp (1- r/c) c<r (1) 


c) When the length of the candidate sentence is greater than the reference sentence, the penalty is 
computed using (2). 


LP = exp (1-—c/r) c>r (2) 


Where c and r represent the lengths of both candidate sentence and reference sentence, respectively 

Furthermore, this work introduced another penalty for the difference between positions of 
different n-grams, where candidate sentences are penalized based on comparing different word positions with 
respect to the reference sentence. If all the word positions are the same, the penalty is applied, and its value is 
1. The penalty value varies between 0 and 1. When no positions match at all, the maximum penalty (of value 
0) is applied. In (3) is used to compute the position-based penalty. 


npd_score = exp|PD| (3) 


Where PD denotes position difference and npd_score represents the position difference score. The position 
difference penalty is built on the length penalty and they are in direct proportion to each other. An increase in 
length penalty will lead to an increase in PD penalty. The proposed method also computes the common word's 
score by measuring the ratio between word overlap (i.e., Nc) between the reference and candidate sentence to the 
total number of unique words (i.e., Tw), as shown in (4). 


Cw_score = Nc/Tw (4) 


Parts of speech (POS) tags show the syntactic relation between two sentences. The POS tags score 
for Arabic, defined in [36], is added to the proposed method for capturing the syntactic structure of the 
translated sentence with the reference sentence. Arabic is a linguistically rich language. Due to the linguistic 
richness of Arabic, a context can be represented in different ways. Therefore, when the candidate sentence is 
contextually similar to the translated sentence, but the words in the candidate sentence are not the same at 
some positions but are synonyms, it will result in a position penalty. Thus, POS tags are used in this study to 
normalize that penalty term. The POS tags will help the model penalize the penalty term by comparing it 
with the POS tags of the reference sentence. Finally, the length penalty, position difference penalty, and the 
POS score are aggregated, as shown in Figure 1. The final score is computed using (5). Table 2 show some 
samples used in the proposed evaluation model. 


Final_score = LP * npd_score * cw_score * pos_accuracy (5) 
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l LP [ma score | 


- Table 2. Results of LP, position difference and 
| POS Score POS tags model 


Candidate Final 
Reference Sentence 
Sentence score 


cw_score 


pooling (mean, max, scaled) hi s A Gla 3 hll cas 07 
The plane is flying The plane was flying r 
iglar b palani ETERSEIE 
brokers are : f 1 
: ki brokers deal in 
final score working 


5 Ahia) A ee 
peas Bons Ae jas: jab 5 sul) 
ship sailing 


Figure 1. Model architecture The plane is flying very fast 


4. BERT BASED SENTENCE SIMILARITY 

In 2018, Devlin et al. [37] introduced the bi-directional encoder representation for transformers 
(BERT) to compute context-based word vectors based on the encoder part of [38], i.e., an attention-based 
neural MT model. It captures word vectors based on the context where the word is used. Masked language 
modeling (MLM) and next sentence prediction (NSP) are the two pre-training tasks of BERT which help the 
model learn good syntactic and semantic language representation. 

The BERT model is set as state-of-the-art for many NLP tasks, which makes it important to utilize it 
in this evaluation as well. A considerable amount of data and computation power is required for language 
model pre-training. The language model cannot be trained from scratch because of the lack of computing 
resources and data. In this study, the distilled multilingual BERT model is used that supports over 100 
languages. This model does not have high accuracy in low resource languages such as Arabic and Sindhi 
[39]. To overcome that, the multilingual BERT model is fine-tuned on a corpus of Arabic texts containing 1 
million clean Arabic sentences. The data is scraped from different Arabic News resources and blogs, and 
spacy is used for the necessary pre-processing of data. The data is normalized to remove diacritics and other 
unnecessary HTML tags, and URLs. Moreover, sentence tokenizer available in spacy is used to split 
documents into sentences. 

Once pre-training is complete, the model is fine-tuned on a custom task such as sentiment analysis, 
POS, and named entity recognition (NER). Since the goal is to compute the similarity between candidate and 
reference sentences, therefore, the data is hand-curated for this task. The Arabic linguistic experts validated 
the data for sentence similarity tasks by looking at the syntactic and semantic features of the sentence. A total 
of 10,000 Arabic cleaned sentences are collected and tweaked so that their context remains the same, but they 
can be represented with different words or helping verbs. Moreover, 10,000 different sentences are also 
selected, and random sentences are put against them. The sentences with little tweaking are labelled similar, 
and the random sentences are labelled not similar, as shown in Table 3. This fine-tuning used a Siamese triple 
loss network. The architecture of the Siamese network is depicted in Figure 2. 


Table 3. Sentence similarity dataset 


Sentence1 Sentence2 Label 
VSS Ssa Similar 
was running he runs 


chba) ima iiia] 


: Not Similar 
he's adoctor The truth is hard 


This architecture computes the Softmax loss for both the labels, which are similar and not similar, 
where u and v are the two vectors (each generated from candidate and reference sentence), fed to the 
similarity algorithm to calculate the similarity between the two. After that, the proposed model is fine-tuned 
on the semantic textual similarity (STS) dataset [40]. This dataset is freely available on the Stanford text 
similarity benchmark dataset. For this study, the dataset is translated from English to Arabic and the wrong 
translations are manually corrected with the help of linguistic experts. This dataset is available for regression 
tasks, and the score is assigned from 0 to 5 for each sentence pair. 

Next, the last Softmax layer is changed, and the model is fine-tuned on the sentence similarity task 
using cosine similarity. The labels are normalized between -1 and 1 for computing the cosine score, where 1 
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refers to completely similar and -1 denotes the opposite. This finalizes the proposed model for computing 
cosine similarity for Arabic sentences. The reported results are provided in Table 4. 


Softmax classifier Table 4. BERT sentence similarity results 
Candidate Sentence Reference Sentence pee 
Similarity 
(u, v, |u-v]) 
AT cl OSs Gye SEB a CM GX Ge itl 
Moving from one place to Move from one place to 0.8 
| u | v | another another 
biil susali iil 535 diil ciaal yaa 
pooling pooling A dungeon for violent Prison for violent 0.82 
4 4 risoners risoners 
P. P 
BERT BERT 4 paper 3 4 me oa 7 
in mbinga iaaii y aS 
xine Jsi Jl (cle 
: 0.61 
SeMieiees) SEMENCE G Run at a moderately fast Run at a fast pace 
pace instantly 


Figure 2. Siamese network for softmax 


5. EXPERIMENT SETUP AND RESULTS 

All the experiments were performed on a powerful GPU machine having an NVIDIA 24 GB GPU 
and CPU having RAM of 128 GB. The POS model training took 4 hours, whereas BERT fine-tuning for 
200k steps is completed in five days. Moreover, the BERT fine-tuning on sentence similarity tasks took 3 
hours. The samples used in this research are extracted from 250 paragraphs collected from online sources that 
provide bilingual corpora for the English-Arabic language pairs. These are as follows: 

a) Reverso, which provides a huge number of bilingual texts derived from real-life contexts; and 
b) The UN Parallel Corpus, particularly the English-Arabic texts. 

Since colloquial Arabic has more than 25 different dialects, the sample paragraphs used in this work 
are taken from sources that adopt Modern Standard Arabic, which is widely used in official contexts. It is 
worth noting that all machine-based translations are generated using Google Translator, while the reference 
translations are based on the sources mentioned hereinabove after the review by a translation expert. The 
reported results are based on MT using the Google Translator service for translating English to Arabic 
sentences. Results of the proposed model are compared against state-of-the-art BLEU and METEOR utilizing 
Pearson Correlation as a performance evaluator parameter. Pearson correlation coefficient is given by (6): 


Tiy = Lii-X) Vi-Y) (6) 


(n-1)sxsy 


where, X, y are mean values and sy, S, are square roots of variance. 

The reported results show that the proposed model outperforms others, in terms of accuracy and 
linguistic validation, during the evaluation of machine translation tasks for English-Arabic sentence pairs. 
The evaluation is performed using a corpus of 2,000 English and Arabic sentence pairs. Evaluation scores are 
computed for all the methods and their scores are correlated with the actual scores given by human 
translators. A detailed comparison of varied metrics is given in Table 5. The human evaluators evaluated the 
machine-translated sentences against the ones provided by the human translators and provided the evaluation 
scores. The average of all the evaluators for the MT system came out as 0.6954. Among the various 
automatic evaluation of MT systems, the POS + BERT (score: 0.6549) results came closest to the human 
evaluations in evaluating the accuracy of MT systems. The proposed system of POS + BERT evidently 
outperformed the standard BLEU and METEOR metrics for machine translation evaluation in the case of the 
English-Arabic language pair. Moreover, adequacy and fluency are checked for translated sentences against 
the human labeled sentences. English sentences and their corresponding Arabic translations were presented to 
human experts who were asked to evaluate these sentences on a 5-point scale, i.e. (1: no sense, 2: non- 
acceptable, 3: acceptable, 4: good, 5: ideal). To ensure that human judgments are not biased towards a 
particular sentence, reference translations employed in the automatic evaluation were not disclosed to the 
humans. These evaluations were gathered from three different native Arabic-speaking subjects. 

Scores given by the subjects are considered average as the overall human judgment for adequacy 
(comprehensiveness) and fluency (naturalness) as well. The respective scales for Adequacy are, none: 1, little 
meaning: 2, much meaning: 3, most meaning: 4, all meaning: 5; and for fluency are, incomprehensible: 1, 
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disfluent: 2, non-native: 3, good: 4, flawless: 5. In order to assess the performance of the proposed method in 
terms of adequacy and fluency; 40 paragraphs are chosen randomly and provided to human evaluators to 
score based on fluency (naturalness) and adequacy (comprehensiveness). After that, the scores were plotted 
against the BERT similarity scores and proposed metric scores. Finally, the pearson’s correlation coefficients 
is calculated to correlate fluency/adequacy scores and the proposed metric with POS/proposed metric with 
POS+BERT. The obtained results are included in Table 6. The relevant graphs are shown in Figures 3 and 4. 

The obtained results validated that the proposed method outperforms BLEU and METEOR with 
reference to capturing the Adequacy and Fluency in candidate sentences. The proposed method considers 
syntactic and semantic features which other metrics lack. Furthermore, the linguistic features are considered 
to compute the similarity of candidate sentences with reference sentences. In contrast, BLEU and METEOR 
do not consider the syntactic and semantic properties in an embedding space. 


Table 5. Results of correlation for different metrics: system-level score and correlations with human judgments 


Metric Scores Coeff. of corr. 
BLEU 0.5273 0.3929 
METOR 0.5627 0.6142 
Proposed with POS 0.6142 0.6280 
Proposed with BERT 0.6522 0.6842 
Proposed with POS+BERT 0.6549 0.6763 
Human 0.6154 - 


Table 6. Pearson correlation comparison with adequacy and fluency 


Pearson’s correlation coefficient 
Proposed metric with POS Proposed metric with POS+BERT 


Fluency score 0.6933 0.7211 
Adequacy score 0.6824 0.7123 
HUMAN EVALUATOR [ADEQUACY SCORE) HUMAN EVALUATOR (ADEQUACY SCORE) 


w 
w 


w w 
Š š 
Za +. © +o- ga + wo eo 
S Jee 2 real hl 
3 + © 00> qa- o z3 + © ew aw- e 
= n gi Ss - 
3 Ps 2 es 
S2 _B se teere œo S2 OOS 000 o 
2 = z 
gı + 31 + 
=] = 
=o =o 
0 0.2 0.4 0.6 o.s 1 0 0.2 0.4 0.6 08 1 
PROPOSED METRIC SCORE PROPOSED METRIC WITH BERT SIMILARITY 


Figure 3. Adequacy score of proposed metric vs human evaluator and proposed metric with BERT similarity 


HUMAN EVALUATOR (FLUENCY SCORE) HUMAN EVALUATOR (FLUENCY SCORE) 


HUMAN EVALUATED SCORE 
or BM wee YH 
\ 
$ 
` 
\ 
¢ 
+ 
IUMAN EVALUATED SCORE 


o 0.2 0.4 0.6 0.8 1 
PROPOSED METRIC WITH BERT SIMILARITY 


Figure 4. Fluency score of proposed metric vs human evaluator and proposed metric with BERT similarity 


6. CONCLUSIONS AND FUTURE WORK 

Despite the numerous challenges associated with the evaluation of MT into low-resource and 
morphologically rich languages, the concerns are not well-addressed in the current literature. This study is 
believed to set a new direction in machine translation evaluation for morphologically rich languages like 
Arabic. In this work, a context-based approach is utilized that goes well into the semantics of the language 
and understands the similarity between machine translation and human translation. The experimental results 
demonstrate that the proposed method surpasses the previous state-of-the-art in English to Arabic translation. 
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Consequently, the proposed method encourages the utilization of automatic and semantic-based methods for 
machine translation. However, the integration of more linguistic features and semantic embedding along with 
the proposed method can be explored in the future to improve the MT evaluation system. 
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